Not just NVIDIA: GPU programming that runs everywhere
If you’re doing computations on a GPU, NVIDIA is the default, alongside its CUDA libraries. Some libraries like PyTorch support do support AMD GPUs and Macs. But from the re-implementations of NumPy, SciPy, and Pandas in the RAPIDS project, to Numba’s GPU support, NVIDIA has best software support in the Python world.
Sticking to NVIDIA-specific software has some downsides, however:
- It won’t run on modern Mac laptops.
- Testing in CI is more difficult: you need custom runners that have NVIDIA GPUs.
- You can’t use any other GPUs you might have access to, like AMD GPUs.
What can you do if you want to use GPUs in a portable manner?
In this article we’ll cover one option, the wgpu-py
library.
WebGPU, wgpu
, and wgpu-py
WebGPU is a standard (in-progress) for writing portable GPU-based programs (or “shaders”) in web browsers. It’s still experimental, but Chrome and Firefox are building prototypes. Graphics programming is the major use case, but WebGPU also supports compute-only shaders.
Of course, not everything runs in a browser!
Which brings us to wgpu
, a Rust library that re-purposes the WebGPU API for environments outside the browser.
Notably, it supports a variety of backends (Vulkan, GL, Metal, DirectX 12), with the effective result that it can run on:
- Modern GPUs on Linux or Windows.
- Modern Macs.
- CPUs, using the
lavapipe
CPU-based implementation of Vulkan.
And wgpu-py
is a Python wrapper for wgpu
.
Add this all up, and you can use wgpu-py
to run GPU programs pretty much anywhere, from normal CI runners to your Mac laptop to, yes, a NVIDIA GPU.
An example: adding two NumPy arrays
To show wgpu-py
at work, we’ll implement an example that adds two NumPy arrays together.
The point here is not to write good GPU code, or interesting code, it’s just to demonstrate the library at work.
import numpy as np
from time import time
INPUT1 = np.arange(0, 1_000_000, dtype=np.int32)
INPUT2 = np.arange(2, 1_000_002, dtype=np.int32)
assert INPUT1.shape == INPUT2.shape
Here’s the program we’ll be running on the GPU, implemented in WebGPU Shader Language (WGSL), a new custom language developed for WebGPU:
SHADER = """
@group(0) @binding(0)
var<storage,read> input1: array<i32>;
@group(0) @binding(1)
var<storage,read> input2: array<i32>;
@group(0) @binding(2)
var<storage,read_write> output: array<i32>;
@compute
@workgroup_size(50)
fn main(@builtin(global_invocation_id) index: vec3<u32>) {
let i: u32 = index.x;
output[i] = input1[i] + input2[i];
}
"""
We can run this on a GPU using wgpu-py
, which you can install with pip install wgpu
:
from wgpu.utils.compute import compute_with_buffers
def run_on_gpu():
# Map from input binding number to the relevant NumPy
# array:
inputs = {0: INPUT1, 1: INPUT2}
# Map from output binding numbers to the length and
# type of the output. "i" means signed 32-bit integer.
outputs = {2: (len(INPUT1), "i")}
# The workgroup size is 50...
result = compute_with_buffers(
inputs, outputs, SHADER, n=len(INPUT1) // 50
)
# The result maps binding numbers to a memory view.
return np.frombuffer(result[2], np.int32)
For comparison, here’s how we’d add the two arrays on the CPU using NumPy:
def run_on_cpu():
return INPUT1 + INPUT2
Now we can run both versions:
from wgpu.utils import get_default_device
from time import time
device = get_default_device()
print("GPU:", device._adapter.request_adapter_info()["device"])
# Run once to prep everything without distorting timing:
run_on_gpu()
start = time()
for _ in range(10):
gpu_result = run_on_gpu()
print("GPU mean elapsed:", (time() - start) / 10)
start = time()
for _ in range(10):
cpu_result = run_on_cpu()
print("CPU mean elapsed:", (time() - start) / 10)
assert np.array_equal(gpu_result, cpu_result)
Running on non-NVIDIA GPUs
This program can be run on the integrated GPU of an i7-12700K:
GPU: Intel(R) UHD Graphics 770 (ADL-S GT1)
GPU elapsed: 0.008639335632324219
CPU elapsed: 0.0013203620910644531
You can also run it on the integrated GPU of an ARM-based Mac:
GPU: Apple M1
GPU mean elapsed: 0.0035706758499145508
CPU mean elapsed: 0.00032808780670166013
In theory it should also be possible run on a CPU by installing lavapipe
, allowing you to run the code in normal CI runners.
While the CPU version is faster in both cases, that is pretty meaningless. This example was chosen for simplicity, not for usefulness, nor for relevance to performance.
Considering the tradeoffs
Portability isn’t everything, and using wgpu-py
has its downsides.
Even from my perspective as someone who knows very little about GPU programming, some obvious issues include:
- The WGSL language is still a draft standard, as is WebGPU in general.
wgpu
and thereforewgpu-py
do support other shader languages, though. - As mentioned above, sticking to NVIDIA hardware gives you a large number of pre-written libraries that won’t work elsewhere.
- If you’re not already familiar with this field, WGSL—or the other supported shader languages—are a new language you need to learn. Compare that to Numba, which lets you write CUDA programs with Python syntax.
- A lowest-common-denominator language won’t let you take full advantage of your hardware.
Nonetheless, the underlying wgpu
Rust library is finding users in the real world.
Visit the wgpu
site’s Showcase section (scroll down) to see examples, albeit with a graphics-oriented slant.